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Abstract 


We establish minimax optimal rates of convergence for estimation in a high di¬ 
mensional additive model assuming that it is approximately sparse. Our results reveal 
an interesting phase transition behavior universal to this class of high dimensional 
problems. In the sparse regime when the components are sufficiently smooth or the 
dimensionality is sufficiently large, the optimal rates are identical to those for high 
dimensional linear regression, and therefore there is no additional cost to entertain a 
nonparametric model. Otherwise, in the so-called smooth regime, the rates coincide 
with the optimal rates for estimating a univariate function, and therefore they are 
immune to the “curse of dimensionality”. 

Key words: Convergence rate, method of regularization, minimax optimality, phase tran¬ 
sition, reproducing kernel Hilbert space, Sobolev space. 
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1 Introduction 


With the recent advances in science and technology, high dimensional regression problems 
have become ubiquitous in a multitude of areas - genomics, medical imaging, and finance 
are a few well known examples. Considerable amount of research effort has been devoted to 
the understanding of challenges brought about by the high dimensionality, and development 
of statistical methodology to counter them. Most of the existing work focuses on high 
dimensional linear regression where a number of approaches such as the nonnegative garrote 
(Breiman, 1995), the Lasso (Tibshirani, 1996), the SCAD (Fan and Li, 2001), and the 
Dantzig selector (Candes and Tao, 2007), have been developed to exploit sparsity, or perform 
variable selection; and much progress has also been made to understand to what extent a 
high dimensional regression coefficient vector can be reliably estimated; see, e.g., Koltchinskii 
(2011), Buhlmann and van de Geer (2013) and references therein. 

Linear models, however, could be too restrictive in many applications. As a more flexible 
alternative, high dimensional additive models have attracted much attention in the past 
several years. See, e.g., Lin and Zhang (2006), Yuan (2007), Koltchinskii and Yuan (2008), 
Ravikumar et al. (2008), Meier, van de Geer and Biihlmann (2009), Koltchinskii and Yuan 
(2010) and Raskutti, Wainwright and Yu (2012) among others. Let {(Xj,FJ) : i — 1,.. . ,n} 
be independent copies of a random couple (X, Y ) following a regression model: 

Y — f(X) + e, (1) 

where the error e follows A/"(0, a 2 ) distribution. The additive model amounts to the assump¬ 
tion that 

f(xi, ■ ■ ■ ,x d ) = /i(a?i) H-f f d (x d ), (2) 

where the component functions fjS are modeled non-parametrically; see, e.g., Stone (1985) 
or Hastie and Tibshirani (1990). Here we assume that they reside in certain reproducing 
kernel Hilbert spaces (RKHS); see, e.g., Aronszajn (1950) and Wahba (1990). 

To fix ideas, assume that X follows a distribution n supported on a product space X d 
for some compact subset X of M; and that all component functions come from a common 
RKHS of functions on X, denoted by (7Yi, || • ||^J. It is clear that the additive model (2) 
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can be identified with space 


Ud ■= Hi ® ■ ■ • ® Hi = : X d ->• M| 5 f(xi, ...,x d )= gi(xi) + ... + g d (x d ), 

and gi,...,g d eHi 

Obviously linear models can be viewed as a trivial special case of (2) by taking "Hi to be the 
collection of all univariate linear functions defined over X. Another canonical example of "Hi 
is the ttth (a > 1/2) order Sobolev space W^([0,1]) defined on a unit interval (X = [0,1]). 
See, e.g., Wahba (1990) for further examples. 

We note that for a general g e ?{ d , the additive representation given by (2) may not be 
unique. Define the (quasi-)norm H/H^^) (q > 0) by 

I \g\\t q {H d ) = inf |||(lbi|ki, • • •, lbd|ki) T ||^ : 9i{xi) + ... + g d {x d ) = g(x i, ...,x d ) 

and gi,...,g d ^Hi 

In other words, ||/||^(-H d ) is the ^q norm of the vector of RKHS norms of its component 
functions minimized over all of its additive representations. In particular, when q — 2, 
II ' \\h{H d ) can be viewed as a RKHS norm. More specifically, let K : X x X —> M be a Mercer 
kernel generating the RKHS ("Hi ,1 Hk) and write 

K d ((xi, .. • ,x d ) T , (x[,.. .,x' d ) T ) = K(x i,xi) + • • • + K(x d ,x' d ). 

It is not hard to see that K d is the generating kernel of RKHS (T-L d , || • ||^ 2 (^ d ))- Another 
special case of the £ q (TL d ) norm defined above is the case when q 0. || • \\i 0 (n d ) can be 
interpreted as the smallest number of additive components needed to express a function 
from 1-L d . 

When the dimension d is large, it is of particular interests to consider the case when / 
resides in an £ q ('H d ) ball for 0 < q < 1: 

Br = {ge H d : || j ||» i(Hj) < . 

Write 

\\9\\l 2 (tl) = (^J^g 2 (x)dU(x) S j . 
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We are interested in the minimax optimal rate of convergence for estimating / in terms of 
the squared || ■ ||z, 2 (n) norm. In particular, when "Hi is taken to be the ath order Sobolev 
space W 2 defined on the unit interval, our results imply that the minimax optimal rate for 
estimating / G Brt(£ q ('Hd)) is given by 

2 +n~^ t, (3) 

up to a constant scaling factor. The optimal rate of convergence given by (3) exhibits an 
interesting phase transition phenomenon as illustrated in Figure 1. 



Figure 1: Phase transition in optimal rates of convergence: When the smoothness index 
a and dimensionality measured by log log d/ log n falls in the smooth region in the figure 

_ 2a 

above, the optimal rate is given by n 2 “+ x which is determined solely by the smoothness 
index. On the other hand, if they fall into the sparse regime, then the optimal rate is given 
by (n -1 log d) l ~ q / 2 which is determined entirely by the dimensionality. 


More specifically, when the component functions are not sufficiently smooth in that 


a < - 

q 


1 

2 ’ 


5 




the second term on the right hand side of (3) is dominated by the first one if d is nltra-large: 


d > exp 


n 


_s') 

2 — q \ 2cc+l 2) 


and hence the minimax optimal rate becomes 

(4) 

where we write for two positive sequences a n ^ and b n>d , a nid x b Ujd if a n ^ d jb n ^ d is bounded 
away from both zero and infinity. The rate given by (4) happens to be the minimax optimal 
optimal rate for estimating a d dimensional linear regression when assuming the vector of 
regression coefficient comes from a £ q ball in see, e.g., Ye and Zhang (2010) or Raskutti, 
Wainwright and Yu (2011). On the other hand, when 


d < exp 


n 


- 2 -(^ _2 ) 

2 — q \ 2ck + 1 2) 


the optimal rate is given by 


7 Z(n, d) x n 


2 a 

2cx + l e 


This rate coincides with the optimal rate for estimating / if we know in advance that it 
actually comes from a single component space 'Hi, e.g., /2 = • • • = f d — 0, rather than 
the d-variate function space H d , see, e.g., Stone (1980, 1982) and Tsybakov (2009). Similar 
phase transition depending on the dimensionality d has also been observed earlier for high 
dimensional additive models under exact sparsity (q — 0). See, e.g., Koltchinskii and Yuan 
(2010), Raskutti, Wainwright and Yu (2012) and Suzuki and Sugiyama (2013). Our results 
suggest that such phenomenon is more universal and applies in general to the approximate 
sparse case. 

It is also worth pointing out that such a phase transition in d vanishes when the compo¬ 
nent functions are sufficiently smooth in that 


a > 


1 

q 


l 

2’ 


a phenomenon absent in the case of exact sparsity (q = 0). In this situation, the second 
term on the right hand side of (3) is always dominated by the first one and therefore the 
optimal rate is always 


lZ{n, d) 
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In other words, we pay no extra price, in terms of rates of convergence, for entertaining a 
generally nonparametric additive model (2) when compared with the much more restrictive 
linear models, regardless of the value of d. 

Although we focus on additive models, our general framework is also closely related to 
multiple kernel learning or “aggregation” of kernel machines, a popular technique in machine 
learning to combine multiple kernels instead of using a single one in order to achieve improved 
prediction performance. These type of problems have been studied recently by Bousquet et 
al. (2003), Cramer et al. (2003), Lanckriet et ah (2004), Micchclli and Pontil (2005), Srebro 
and Ben-David (2006), Bach (2008), and Suzuki and Sugiyama (2013) among others. It is 
expected that our results here could lead to further understanding of these problems as well. 

The rest of the paper is organized as follows. We first review some basic concepts and 
properties of reproducing kernel Hilbert spaces in Section 2. Section 3 presents the main 
results. All proofs are relegated to Section 4. 

2 Reproducing Kernel Hilbert Spaces 

We begin with a brief review of some of the basic facts about RKHS which we shall make 
repeated use later on. Interested readers are referred to Aronszajn (1950) and Wahba (1990) 
for further details. In particular, we shall focus on the jth component space, e.g., the RKHS 
defined on the jth coordinate of X 6 X d . 

2.1 Kernel and RKHS 

Recall that K is a symmetric positive semi-definite, square integrable function on A x X. 
It can be uniquely identified with the Hilbert space "Hi that is the completion of 

{K(x, ■) : x E X} 

under the inner product 

= ^2 c i c 'j K ( x ^ x 'j)- 

k *4 

In the rest of the section, we shall write "Hi and 'H(K) interchangeably with the latter notion 
emphasizing the one-to-one correspondence between a kernel and a RKHS. Most, if not all, 


’^c i K{x u -)^d j K(x' j ,-) 
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the commonly used kernels are bounded, which we shall assume in what follows. In fact, 
without loss of generality, we shall assume in the rest of the paper that sup,,, K(x, x) = 1. 
Note that, for any h G 'H(K), 

Halloo : = sup \h(x) \ =sup\(h,K(x,-)) K \ < sup || K(x ,-)|| K || h\\ K , (5) 

xdX x£X x 

by Cauchy-Schwartz inequality. Recall that 

\\ k (x,-)\\k = {K(x,-),K(x,-)) k = K(x,x) < 1 . 


Thus, 

Halloo < II^IIau 

a convenient fact that we shall used repeatedly in the later analysis. 

By spectral theorems, K admits the following eigenvalue decomposition: 

K(x,x') = y, A jkVjkjxfyjkix') 

k> 1 

where X 3 i > \j 2 > ■ ■ ■ > 0 are its eigenvalues and {<Pjk ■ k > 1} are the corresponding 
eigenfunctions such that 

\ ( Pjk, ( Pjk')L 2 ( n^) = $kk'- 

Here is the jth marginal distribution of n, and Skk' is the Kronecker’s delta. It is well 
known that the RKHS-norm of any h G can be written as 

iihiii = y t (k', ( pjk) 2 L 2 (u j )> 

k>1 Ajk 

which means that the “smoothness” of functions in 'H(K) are determined by the rate of 
decay of eigenvalues Aand the unit balls in the RKHS 'H(K) are ellipsoids in the space 
L 2 (n j ) with “axes” yj\jk- For example, it is well known that if is the Lebesgue measure 
on [0,1], then \ jk x k~ 2a for Wf. 


2.2 Complexity of RKHS 

How well we can recover a function from a particular RKHS is fundamentally related to the 
capacity of the unit ball in 'H(K): 


Bi{n{K)) {h G 'H(K) : \\h\\ K < 1} . 



See, e.g., Yang and Barron (1999). In particular, the capacity of B\(TL(K)) can be measured 
by its covering number N(B\(fH{K)),8, || • ||oo) where || • ||oo is defined in (5). Recall that 
for 8 > 0 and a set T of continuous functions on a metric space A, the covering number 
A7( J 7 , 5, || • Hoc) with respect to the || • ||oo metric is defined as the smallest integer m such 
that 

m 

F = \JU 6 T : ||/-/ (i, |U < &} 

2=1 

for some C T. In particular, if A jk = 0(k~ 2a ) and sup fc>1 11fc11oo < oo, then 

log5, \\ ■ Hoo) < V<5 > 0, (6) 


for some constant c > 0. This holds, for example, for Sobolev spaces of order a. 

For our purposes, we are also interested in certain data-dependent estimates of the 
complexity of a function class, namely, Rademacher and Gaussian complexities. See, e.g., 
Bartlett and Mendelson (2002). Write 

Rj n (u) := sup 

ftGBi(W(A-)):||h|| ia( n i )<« 

where <jjS are iid Rademacher variables, that is P(<7j = 1) = P(<7j = —1) = 1/2. The following 
bound of Rj n will become useful for our later analysis. 

Lemma 1. Assume that X ^ < cik~ 2a and sup fc>1 < C 2 for some constants ci,C 2 > 

0. Then there exists a constant c > 0 depending on a, c\ and c 2 only such that for any 
/3 > 0, with probability at least 1 — d~ 3 , 

+ e~A 

uniformly for all u G [0,1]. 

Another quantity of interests to us is the “empirical” Gaussian complexity of the unit 
ball in 'H(K): 

Z jn (u) := sup 

h£Bi(H(K)):\\h\\ L2(n . n) <u 

where II /ri is the jth marginal of the empirical distribution II n . Similar to Lemma 1, we have 
the following bound for Zj n . 


n 


£ M x v] 


2=1 


( 8 ) 


Rjn 


(u) < cn 1 / 2 2 “ + u\J/3 log d + 


^log 


n 


n 


vMxij) 


2=1 


(7) 
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Lemma 2. Assume that A jk < cik~ 2a and sup fc>1 < C 2 for some constants ci,C 2 > 

0. Then there exists a constant c > 0 depending on a, c\ and C 2 only such that for any 
(3 > 0, with probability at least 1 — d~ j3 , 

Zj n (u) < cn _1//2 [u 1 -^ + u\J/3 log d + e~ d ^J 

uniformly for all u G [0,1]. 

Both Lemmas 1 and 2 follow from a standard peeling argument (see, e.g., van de Geer, 
2000). We present their proofs in the Appendix for completeness. 

3 Main Results 

In what follows, we shall assume that there exists a constant r) q > 1 such that 

V q 1 |lfi , lli, 2 (n) ^ ^ (9) 

3 =1 

for any g G where 

g(x x d ) = 5 , i(xi) H- \-gd{xd) 


NIW) = E llftllh- 

3 =1 

Condition (9) is a nonparametric version of the restricted eigenvalue conditions commonly 
used in analyzing sparse estimation in high dimensional linear regression; see, e.g., Bickel, 
Ritov and Tsybakov (2009). It is worth noting that different from the usual restricted 
eigenvalue conditions in linear regression, Condition (9) is on the distribution of X rather 
than the design matrix, or observations AR,..., X n . The condition is satisfied in particular 
when II is a product measure. 

To fix ideas, in the rest of the paper, we shall also assume that there exist a constant 
C\ > 1 and a non-increasing sequence of nonnegative numbers Ai > A 2 > ■ ■ ■ such that 

A k A A jk A c\ A/j, (19) 
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for all j = 1,2 ,,d and k > 1. In addition, similar to the treatment of high dimensional 
linear models (see, e.g., Raskutti, Wainwright and Yu, 2011), we shall assume in the rest 
of the paper that Co'nV 2 < d < e n for some universal constant Cq > 0 to ensure nontrivial 
probabilistic bounds. This, in particular, is true in high dimensional settings where n < d < 

e n . 

We are now in position to present the main results. We first state a minimax lower 
bound. 


Theorem 1. Assume that A k = k~ 2a for some a > 1/2. Under the regression model (1) 
where f G B^qifHd)) and the covariate X follows a distribution II such that (9) and (10) 
hold, and the eigenfunctions {>Pjk '■ j — 1, ■ ■ •, d, k > 1} are uniformly bounded, there exists 
a constant c > 0 depending on a 2 , a, R, c\ and g q only such that 


lim inf sup P 

n ^°° f f&B R {i q (H d )) 


11/ /lll 2 (n) ^ c 



1 - 9/2 


2a 

+ n 2a +i 


> 0 . 


The lower bound is established via Fano’s Lemma. See, e.g., Cover and Thomas (1991). 
We relegate its proof to Section 4. Next, we show that the rates given in the lower bound in 
the previous theorem is attainable. In particular, we consider the least squares estimator: 


/ = argmin 
g£B R (i q (H d )) 


n 


V [Y, - g(X,)f 


i= 1 


(ii) 


The next result shows that / is indeed minimax rate optimal. 


Theorem 2. Assume that A k = k~ 2a for some a > 1/2. Under the regression model (1) 
where f G -B_R(^q('Hd)) and the covariate X follows a distribution II such that (9) and (10) 
hold, and the eigenfunctions {<Pjk '■ j — 1> • • •, d, k > 1} are uniformly bounded, there exists a 
constant c > 0 depending on a 2 , a, R, c\ and r] q only such that for any (3 > 0 with probability 
at least 1 — d, 


11/ - /lll 2 (n) < c (/3 + 1) 



1-9/2 


2a 

+ n 2a + 1 


( 12 ) 


and 


11/ /lli 2 (n n ) — c (/l + 1) 



1-9/2 


2a 

+ n 2a + 1 


where f is the least squares estimator defined by (11). 


( 13 ) 
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The proof of Theorem 2 is also presented in Section 4. It relies on several basic facts of the 
empirical processes theory such as symmetrization inequalities and contraction inequalities 
for Rademacher processes that can be found in the books of Ledoux and Talagrand (1991) 
and van der Vaart and Wcllner (1996). We also use Talagrands concentration inequality for 
empirical processes; see, e.g., Talagrand (1996) and Bousquet (2002). 

Theorems 1 and 2 together immediate imply that the minimax optimal rate for estimating 
/ G B R {£ q (U d )) is 


II? /• 112 /log (A 1 q/2 _ 20 ^ 

Wf - f\\ L 2 (u)- y—J + n 2a+1 . 

This result connects with two strands of literature - estimating high dimensional linear 
regression assuming that the coefficient vector belongs to an £ q ball, and estimating a high 
dimensional additive model assuming that the underlying function comes from a £ 0 {'Hd) 
ball. In the case of linear regression, it is known that £\ penalty or the Lasso (Tibshirani, 
1996) leads to rate optimal estimators under suitable regularity conditions. See, e.g., Ye and 
Zhang (2010). Similar phenomenon has also been observed for the high dimensional additive 
models where it is shown that a mixed £\ norm penalty of the form 


Ill'll Hi + a n ^ Ibjll^Ohn) 
3 =1 3 =1 


(14) 


can lead to rate optimal estimators with appropriate choices of the tuning parameter a n > 0. 
See, e.g., Koltchinskii and Yuan (2010) and Raskutti, Wainwright and Yu (2012). The 
use of a mixed £\ penalty of the form (14) highlights the difference between linear models 
and additive models. When dealing with nonparametric component functions, we need to 
penalize both the RKHS norm and Lo norm, the former ensures smoothness of the estimate 
whereas the latter is needed for thresholding redundant components and hence sparsity. 

A natural question is whether or not a similar strategy will lead to minimax rate optimal 
estimators under £ q (Hd ) ball for general 0 < q < 1. Somewhat surprisingly, the answer 
appears to be negative in general. And we give here a heuristic argument why. The challenge 
occurs in the smooth regime where 


1 1 

a< - - 
q 2 


and 


d < exp 


n 


- 2 -<—L _ 

l-q V 2a + l 2 ) 


Recall that the corresponding minimax optimal rate of convergence in the smooth regime is 
given by 


2a 

n 2a + 1 . 
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As pointed out before, this is the best possible rate of convergence even if there is only one 
nonzero component. And to achieve this rate, we need to choose 



(15) 


because, if a n is smaller, then in the particular case of one nonzero component, the minimax 
optimal rate cannot be attained. See, e.g., Tsybakov (2009) or Koltchinskii and Yuan (2010). 
Now for a general / from the unit Iq^Hd) ball, we will need a diverging number of nonzero 
components to approximate it. More precisely, as we shall show in the proofs, we may need 
estimate up to 



nonzero components to balance the approximation error and sparsity. If we choose a n to 
be of the order given by (15), then each component can only be estimated with squared L 2 
error of the order of 



leading to an overall rate of convergence no smaller than, up to a multiplicative constant, 



at least under the assumption that II is a product measure. This rate is obviously suboptimal. 
As a result, in the smooth regime, no matter what value a n is, we cannot attain the minimax 
optimal rate of convergence through a mixed i\ penalty of the form (14). 

4 Proofs. 

We now prove the main results Theorems 1 and 2. For brevity, we shall also assume that 
cr 2 = 1 and R = 1 in the proofs. The more general case follows an identical arguments with 
different constants. 

4.1 Lower bounds. 

We establish the lower bound via Fano’s Lemma. To this end, we need to construct a set of 


functions 


G'.= {g\...,g M }cB 1 (£ q ('H d )) 
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that are sufficiently apart from each other. Let N be a natural number whose value will 
be specified later. For a matrix A G { — 1,0, l} dxJV , denote by sa the number of its nonzero 
rows, that is 

sa = card {i : A k ^ 0 } , 


where A- is the ith row vector of A. Write 


d N 


g A {xi, ...,x d ) = N 1,2 s A 1/5 EE a jk^j,N+k ( fj,N+k( x j) 


j =1 k= 1 


It is clear that 


IMIW> < ^VE 


3 = 1 


V 


a jk'^j t N+k ( fj,N+k{ x j^ 


1/2 

u,jk/\ 

k= 1 

AT \ V 2 


Hi 


= S 


i'E^'T' 

i=i \ fc=i 


jk 


Because a 2 k G {0,1}, this can be further bounded by 


ll^llhhHd) — El* 7^ 0) — 1, 

3 = 1 


which implies that (/a G Bi(£q('Hd))- 

We now describe how to generate the set fL In particular, we consider functions of 
the form g A with A G {±l,0} dxAr as described before. We first choose s rows of A to be 
nonzero, and set the rest of the rows of A to be zero. The value of s will become clear later. 
To this end, we appeal to Vershamov-Gilbert Lemma which states that we can find a set 
{$i,..., 6m x } C {0, l} rf such that 


(a) 110*11^ = s for 1 < k < Mp 


(b) for any k ^ k!, || 6 k - 0 fe /||p > s/2; 

(c) log M, > |slog(d/s). 

See, e.g., Massart (2007). For a given 0, we set zero the rows of A if the corresponding 
coordinate of 6 is zero. In the next step, we fill in the remaining rows of A with ±1. Again, 
by Vershamov-Gilbert Lemma, there exists a set (Ti,..., Tm 2 } £ {±l} sxiV such that 
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(a’) for any k ^ k', ||Tfc — ||f > Ns/ 2; 

(b’) log M 2 > Ns/8. 

For a given T, we shall fill in the nonzero rows of A by T, leading to a collection 

Q = {dA(9j,r k ) ■ 1 < j < Mi, 1 < k < M 2 }, 

where A(9, T) is a d x N matrix whose ith row is zero if the ith entry of 6 is zero, and the 
collection of the nonzero rows of A is given by T. In what follows, for brevity, we shall write 


Q = {9A k : 1 <k<M}, 

where M = M\M 2 and 

A = {A k : 1 < k < M} 

is the collection of d x N matrices of the form A(9j,T k ). By (c) and (b’), 

log M > ^slog(d/s) + -Ns. 

4 8 

Note that, for any two matrices A, B 6 {—1, 0, l} dxAr such that sa = sb ='■ s, we have 


\\9a ~ 9b\\l 2 (ti) ~ N ' 


r / d N 

' 1 N 2/q / ~ hj^N+kW, 

Jxd \U t! 


N+ k(xj) I dn((x!,... ,x d ) ) 




j = 1 


N 


"Yj^A bjk)\j, N+k (pj,N+k 


k =i 


I* Ob) 


d N 

= i]q 1 N^ 1 s~ 2 ^ q Y Y X j,N+k(a jk - b jk ) 2 
j =1 k =1 

where the inequality follows from (9). By (10), this can be further lower-bounded by 

d N 

II 9a ~ fte||i 2 (n) > c^ 1 r)- 1 N~ 1 s~N q Y Y X N+k(ajk ~ b jk ) 2 

j =1 k =1 

d N 

> C^ 1 77" 1 A^ _1 S _2/ ' ? A 2 7V Y - b N) 2 

j =1 k= 1 

= c~/ 1 i']~ 1 2~ 2a N~ l ~ 2a s~ 2 ^ q \\A - 5IH. 

By construction, for any A ^ A G A, 

\\A~A%> Ns/2, 
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and hence, 


\\9a -9A'\\l 2( n) > c A \ l2 1 2a N 2a s 1 2/ L 
On the other hand, for any A e A, 


\\9a \\ 2 


Mn) 


f . / d N 

= N~ 1 s^ 2/q / I EE 

\j=i fc=i 


jv+fc(ajj) dH((x 1 ,... ,x d ) T ) 


< ■ nq N- 1 s~ 2/9 ^2 


3= 1 


AT 


a jk^/ t N+k ( Pj,N+k 


k= 1 


ia(n,-) 


d JV 

EE A j,N+k a jk 

j= 1 fc=l 
d iV 

< c\r] q N~ l s~ 2/q EE AjV+fcOj 

3=i fc=1 
d N 


jk 


< c x r) q N 2 /g A Ar EEd 


2 

jfc 


3=1 fc=l 


= C\TjqN~ 2a S^~ 2 ^ q . 


Following a standard argument, the lower bound can be reduced to the error probability 
in a multi-way hypothesis test. See, e.g., Tsybakov (2009). More specifically, let O be a 
random variable uniformly distributed on {!,..., M}. Then it can be deduced that 


inf sup P 

f f&Bi(e q (n d )) 


II f ~ /llL(n) 



min 

A^A'eA 


WdA — <M'||i 2 (n) 


> inf P{0 7 ^ 0}, 
© 


where the infimum on the righthand side is taken over all decision rules that are measurable 
functions of the data. By Fano’s Lemma, we get 


p{e/©1 *1, • • •, *n} > 1 - ]() 1 y \lx lt ...,x n {Yi, • • •, n,; o) + log 2 ], (16) 

where Ixi,...,x n (Ei,..., Y n \ 0) is the mutual information between 0 and Y\,...,Y n with 
Ad,..., X n being held fixed. It is not hard to derive 


[lx,. xJYl,---,Yn 10)] < 



E .x„K(P s J|P 9 „) 

A^A'gA 

-1 


n 

< - 
~ 2 


M 

2 


E E *i,-Anll 9A - flt4'||l 2 (n n )> 

A^A'&A 
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where /C(-[ [-) denote the Kullback-Leibler distance, P 9 stands for conditional distribution of 
{ Yi : 1 < i < n} given {X r : 1 < i < n} and the true regression function in (1) is given by 
/ = g, and for any g : X d —> M, 

1 n 

ll»llL(n„) = - Ei»( A '.)] 2 - 

Thus, 

E*. x n [l Xl . x n (Y 1 ,...,Y n -,Q)] < 

< 

< 

< 


y ! II 9a 9a'\\l 2 (u) 

A^A'eA 
Tl 

2 

2 nmax|MH 2(n) 

2c x ri q nN- 2a s 1 - 2/q . 



Now, from (16), we get 


> 

> 

> 


inf sup Pi||/ 

/ /efii(^(H d )) 1 


/II 


2 > cl 1 r]- 1 2- 2 - 2a N- 2a s l - 2/q 


infP{© 7 ^ 0 } 

§ 

x _ e -Vi,-A ■■■/»; Q)] + log 2 

logM 

j 2c\rj q nN~ 2a s 1_2/q + log 2 

\s\og(d/s) + \Ns 


Taking N = 1 and 


s = Ci 



< 7/2 


for a sufficiently small constant Ci > 0 yields 


inf sup P 

/ /SB i{t q {H d )) 


\\f-f\\l>c 2 



1 - 9 / 2 ' 


>3/4, 


(17) 


for some constant C 2 > 0 depending on a, rj q and C\ only. On the other hand, if a < 1/q— 1/2, 
taking 

s = 1, and N = Cpn 2a + 1 


for a sufficiently small constant Ci > 0 yields 

inf sup P |||/ — /||1 > C 2 n-^+iX > 3/4. (18) 

/ /eBi&fWd)) 1 J 
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Combining (17) and (18), we have 

inf sup P I II/- /||1 > C 2 . . 

/ /Gfii(^(H d )) [ \ n J 

which completes the proof. 


logdV 2c 

+ ?Z 2c+l 


>3/4, 


4.2 Upper bounds 

We now prove the upper bounds given in Theorem 2. By definition, 

-t n 2 1 n 

- £ \y, - /(v) < - Y, K - /(V)] 2 , 

n z — J L J n z —' 

Z=1 2=1 


which immediately implies that 

1 

n 


n 

y [/(a,) - /(v) 

z=l 


2 U XMhy -/to 


(19) 


2=1 


Write Aj = fj — fj and A — f — f. It is clear that A = Xp=i Ar 

Our main strategy is to derive upper and lower bounds for the right and left hand side 
of (19) respectively, and then put them together to derive (12). 


Step 1. Bounding the righthand side of (19). Observe that 

IIA/IU 2 (n,' n ) 


n 




3\ X ij> 


i— 1 


A || Ajf ||%i Zj n 


IA 


1 \\Hi 


where Zj n is defined by ( 8 ). By Lemma 2, this can be further bounded by 

C in ~ 1/2 (||A i ||£*J|A i ||*[ + ||Aj|| L2(nj .„) y/(P + 1) logd + e" d ||Aj|| Wl ) 

for some constant C\ > 0, with probability at least 1 — By union bound, with 

probability 1 — gO* 3 , 


- X> [/(*)-/(*)] < 


i =1 


3 = 1 


n 


A 


3\ X ijj 


i =1 
d 


< 2C 1 n~ 1/2 Y^ 11 A. 


1 L_ _L_ 

I 1 2 a II A II 2a 

3\\L2{Yl jn )\\ L ^3\\Ui 


3 =1 


+2Ci?r 1/2 y/(^ + l)l°gd^ HAjlUafn,-. 


l=i 


+2Ci?r 1//2 e d | AjU^. 

l=i 


( 20 ) 
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We denote by E\ the event that the above inequality holds. We now bound the three terms 
on the rightmost side separately. 

We first derive a bound for 


n 


- 1/2 


DlAillSIlA 


j\\L2(u. jn y 


3 =i 


We treat the cases of 2/(2 a + 1) > q and 2/(2 a + 1) < q separately. 

Case 1: 2/(2 a + 1) > q. By Young’s inequality, for a constant ( > 1 whose value will be 
specified later, 




4 a 2 a 


n 1/2 \\ A j\\nJ\ A j\\l 2 (u jn ) < C ^W^W'Un^ + C^n ^IIAIIkT 

Note that for any q < q' < 2, 

EiiAjiih < 2(^ii7jii^+Eii/jii^ 


2 

2a + l 


3=1 


Hi 

,3=1 3=1 

\j =1 3=1 

< 4. 


In particular, we get 


Hence, 


E iimist 1 < 4. 

3=1 


E"“ 1 / 2 ll A /ll*J A illim., S r i *l|A,||i l(n „ ) +4C^»-»*. ( 21 ) 

1=1 


Case 2: 2/{2a + 1) < q. Write 


n 


- 1/2 


E 

3=1 


I A, 


1 

I 2a 

In! 


IIA 


1 1 2a 

3\\L 2 (n jn ) 


1/2 E iaj^iajEL 

t ; ii A jii«i > n_i/2 


+n 


- 1/2 


3 -II 


E 


II 


IIA- 



IIA/IIL 2 (!?,■„)• 
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For the first term on the right hand side, by a similar argument as before, we have 


n- 112 E IIMIS-.IIMlhi) 

< E iiMiL(n„) + c™»-^ E ii a j list 1 

j: |a J ||* 1 >n-'/ J jillAjll*, >»-■/“ 

< E II^IIL^) + c^«- (1 - f) E HMih 

■? : ll A jllwi> n-1/a r-W&jW-H^n- 1 / 2 

< r^ 1 J2 ii A iilL(n jn) + 4 c^ T ^' (1 ' !) > 

j : II a jIIhi>^ 1/2 

where in the last inequality, we used the fact that 

E HMih < E HMih s 2E (iitii«i + utiiy) < 4 - 

j:\\A 3 \\ ni >n-l/l 3 =1 3 =1 

On the other hand, because 

ll A jllM>b«) < ll A ilUoc < 11 A j 11 Hi i 

we get 

«- i/2 e ii^nliiMihi) ^ " _1/2 e hmik. 

i:|! A jH-Hi< n_1/2 J:|| A jIIki<71 _1/2 

< n- (1 - ,/2) E HMIh 

i : l|Aj II*, <»- 1/3 

< eiimii,, 

j'=i 

< 4n A1-9 / 2 ). 


Thus, 


«' ,/2 EII a jII4"II a jII 

3 = 1 


d 

1—U- 4o! ^\ 

2° , < / 2cVTl \ 

i 2 (n jn ) - <> 






-(!-§) 


( 22 ) 


j=i 


Combing (21) and (22), we get 


«- 1/2 V HAjll j-llAjIlEE) <C^E IIMILw.) + 8C^« 


j-1 


4Q -(l-maxif,^}) 


3 =1 


(23) 
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By Theorem 4 of Koltchinskii and Yuan (2010), there exists a numerical constant C 2 > 1 
such that with probability at least 1 — d ~ 13 for all h G Hi, and j — 1,..., d, 


and 


II^IU 2 (n i ) < C 2 


\\h\\L 2 (n jri ) < C 2 


II^IU 2 (iV) + n 2 “+! + 


11 ^ 11 z ,2 (n^) + n 2 “+! + 


{P + l)log d\ 


n 




{P + i)iog<A 

n J 


Hi 


Hi 


Denote by £ 2 the event that both (24) and (25) hold. Under £ 2 , 

d d r 

< 2 c 2 2 J 2 


/ . lAIILrn,.) ^ 

3 = 1 


lA ,||2 


_J 2 a_ (f3 + 1 ) \Og d\ 

+ n 2 “+! + —- — ' 


3 = 1 
d 


n 




I A,-" 2 


3 »Hi 


< 2 C;E II^IIU) +8CI (n-^h + itt W ) , 

j=l X ' 


where the second inequality follows from the fact that 


V II A jIIhi A 4. 


j=i 


By (9), this implies that 


E IAIlL<n,„) < 2Cf„,||A||l l(n) + 8 C 2 (n~& + . 


Together with (23), we get 


n 


- 1/2 


,1-7 


^ IIAillln ll A illuorrL~) A ^C 2 r] q C 2 “ 1 II A||2 2 (n) 

3 = 1 


-y2 — 4 a _ 2a (/3 + 1) log d\ 


+ 8C2C 2a-l ^77, 2 Q +1 + 

+ 8( ^ n" (1 _maX{ 2 ’ 2JT+T }) . 


n 




(24) 


(25) 


(26) 


The second term on the rightmost hand side of (20) can also be bounded under event £ 2 . 
By (25), 


ll A jlU 2 (n, n ) < C 2 ^2 ll A jlU 2 (T/) + C 2 

3 =1 
d 

< C 2 ^2 IIAj||L 2 (iij) + 46*2 


3 = 1 


n 2^+1 4- 


n 2 a + l 4 - 


3 = 1 


(£ + 1) log d 


n 


(P + 1) log d 


n 


H A ill«i 

i— 1 


(27) 
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where in the second inequality we used the fact that 


d d 

E» A > ii«. < E ii a jH«, s 4 - 

3 = 1 i =1 

Write 

d 

£ H A tlU 2 (ih) A £ ll A ilU 2 (n J )+ £ \\ A jh 2 (n,)- 

J ' t:||Adh 2( n,)>V^ ! i:||A i || £2( n i) < V ^ 

The first term can be bounded by Cachy-Schwartz inequality: 


£ || A jlU 2 (n i ) 

r-\\^\\L 2(nj) >VW 

< fcard/j : ||Aj|| L2(n } > 


log d 


n 


1/2 


\ 


E 


II A /lll 2 (nj) 


ki : ll A hh 2 (n,)> 


/ log d 

2 ( LL j)' V n 


Observe that 


card < j : ||Aj|| L 2(IIj .) > 


| o gd }<(!^)" /2 El|A,||| ( ,<4(!^ 


n 


-q/2 


Thus, 


( l°g d\ 


-9/4 / d 


1/2 


£ ll^Ob) < 4 ^ £||A. 

y||Aqu 2(n .)>V^ Vj - 1 

' log d \ ~ q A 


2 

illL 2 (n i ) 


< l|A||« n) . 


Together with the fact that 


E 


l|Aj||L 2 (IL,) < 

t:||Adh 2( no< % /Sf 


E _ll A 4llL«n J )( k f) 


(l-?)/2 


v 7 j =i 


s 4 (w 


(l-?)/2 


1/2 
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we get 


^2 ll A ilU 2 (n,) < W q /2 

3 = 1 

In the light of (27), we have 


logd 


n 


-q/A 


ll A IU 2 (n) + 4 


log d \ 2 
n J 


, Ul(n) 


n 


3 =1 


n J 


+4C 2 n 2 “+i 


logd 


n 


+ 8C 2 ^TT( 1 ^) 


1-i 


n 


where we used the fact that logd < n and C' 2 > 1. 
Combing (20), (26), (29) and the fact that 

d 

Y. ll^ilki < 4, 


3 =1 


we get 


(28) 


(29) 


- £> [7(*0 -/(*)] S ^r^llAlliin, 


2—1 


„ ._ 4a_ ( _2o^ ({3 + 1) log d 

+ C 3 ( 2 “-! ( Tl 2 «+l + 


n 


+C , 3^ 2 ^T n -( 1 - max f|> 2 jiTT}) 

+C 3V ^TTn-^\l^ 

V n 


n 

1-2 


+C3V ^I(^) -* 


+C , 3?Z 


-1/2 -d 


II A IU 2 (n) 


(30) 


for some constant C 3 > 0, under the event £i fl £ 2 . 


Step 2. Bounding the lefthand side of (19). To bound the lefthand side of (19), first 
observe that 


A 


|| 2 


IIAII 


2 

i 2 (n n ) 


s^P (IbllL(n) - lbll! 2( n„)) 

g£B 4 (<i q (H d )) 

ll9lh2(n)<IAIh 2 ( n ) 


(31) 
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Note that for any g G £>4 (£ q ('Hd)), 


- IMI h(H d ) E 


\\ g \\ 9 


e q (u d ) 


< 16, 


and 


li 2 (n) < llfi , lli 00 llfi , lll 2 (n) < 16||fi , ||| 2 (n)- 
By Talagrand’s concentration inequality, for any fixed u £ [0,1], 


< 2 


sup (IMiLcn) - IMlLqiJ 

g£B 4 {t q {U d )) 

Il9lh 2 (n)<« 

E sup (WlLm-llsIlLnJ+te^+P* 

, 9 eB 4 (e q (n d )) V n n 

\ IMh 2 (n)<« / 


with probability at least 1 — e \ By symmetrization inequality, 


E sup (|| 0 ||i 2(n) - \\g\\\ 2(IIn) ) < 2E sup i-J2 a i9 2 ( X i) 

g£B 4 {(. q {H d )) g£B 4 (£ q (H d )) \n 

Il9lh 2 (n)<» Il9 , lh 2 (n)<« 

Note that g 2 is 8 -Lipschitz function on B^qiXid))- By contraction inequality, 

( Tl \ /XX 

~y^ cng 2 (Xi)\ < 8 E sup iy (Tig(Xi) 

n ~t J g&B 4 {t q (H d )) \ n ~i 

H9llL 2 (n)S« l|p|U 2( n)<u 

Again by Talagrand’s concentration inequality, there exists a numerical constant C 4 > 0 
such that with probability at least 1 — e~ f , 


< 


< 


E sup -V(Tig(Ij) 
gGB 4 (e q (H d )) \ n ~i 
llffllz, 2 ( n )— u 


/ 


C 4 


sup 

l g£B 4 (e q (H d )) 
\ llsllz, 2 (n)<« 




A 


n J 
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In other words, 


sup (IMlL(ii) - IMlL(n„)) 

g&B4{i q (H d )) 

Il9lh 2 ( n )-' u 

' / 

d 


< 16C 4 




sup 


E U Ushlli, < 4 3=1 V <=i 
E?=l 9j I 




t t 


"•Ml 

i 2 (n) 


<u 


/ 


with probability at least 1 — 2 e *. 
Note that 


( 32 ) 


1 

n 


Y°i9iM < \\g 3 \\ Hl sup 

' \M ni =l 

ll^lliapij-)— \\9j\\L 2 (,n : i)/\\9j\\'H 1 


1=1 


n 


Y. (TjhjXy 


i=1 


By Lemma 2 and union bound, there exists a constant C 5 > 0 such that 


sup 

\h\ |wi=l \ n 

ll^a(n 3 -)^ u 


n \ 

<Jih(xij) I < C 5 n ~ 1/2 ^rt 1— + u-\/(/? + 1 ) log d + 

*=1 / 


uniformly over u G [0,1] and 7 = 1 ,... ,d with probability at least 1 — d~^. Denote this 
event by S 3 , and we shall now proceed conditional on S 3 . 

It is not hard to see that, under S 3 , 

d 


EhE^ii 

j =1 \ »=1 


Xi 


< C 5 n 1/2 Y (jbjIllTi WdjWL 2 (u j ) + II^IU 2 (npV(^ + l)logd + e d ||^||% 
j=i 

Following the same argument as that for (23), it can derived 


(33) 


n 


-1/2 


sup YMhM 


Ej=i j =1 


jlli 2 (nj) 


I 5^? 1 9i II 


_ 4a 

< C 2 “ _1 sup 

Ej=i llsjlllt,^ 4 3=1 


+ (1 max ^’2 Q+1 » 


|Ej=i 9i || i2 (n)— u 


< £ z^rjqU 2 + 8C 2ct +in (1 max l2>2«+i}). 


(34) 
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Similar to (28), it can also be shown that for any g±,... ,ga such that 


d 

and 

3 =1 


d 

lbjlU 2 (n J ) < U, 

3 =1 


we have 


EtellMn J ,<4^( 1 f i )"’ /4 M + 4( k f) 


1 -g 

2 


Combining (33), (34) and (35), we have 


sup 


Y, ( n Y a ^ x ^ 


3= 1 V'" i=i 

|E.i=i9j|| £2(n) <“ 


< C5C ^ijqU 2 + 8C' 5 C 2 “+ 1 n (1 max f2’ 2a +ib 


+4 cJ ^ + 1)loid [> (!^V ,/4 m+ (!^) 


1=2 > 


n 


-\-C 5 n 


-1/2 -d 


Together with (32), conditional on £ 3 , 


SU P (IMIi 2 (n) _ IMIi 2 (n„)) 

<?eS 4 (UUU)) 

l!9lh 2 (n)<“ 

< CeC^VgU 2 + C , 6 C^Tn _(1 ' max{ 2’2STrb 


1 C °\! ^ + ^ lQg d l '7 1/2 ( l ° gd \ ^ - 1 2? 


n 


M+ lirJ 


+C 6 n 1/2 e d + Cg + 


t t 
n 


( 35 ) 


holds for some constant C 6 > 0, with probability at least 1 — 2e t . Using a peeling argument 
similar to that for Lemma 1, we can make this bound uniformly over u € [0,1]. More 
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specifically, it can be shown that there exist constants C 7 > 0 such that, conditional on £ 3 , 


SU P (IMlL(n) - IMIi 2 (n n )) 

geBi{l q (H d )) 

Il9lh 2 ( n )-' u 

< C 6 ^^=^r] q u 2 + C 6 C2^in _(1 ' max{ 2’2 ^ttD 


K 2 ( 


n 


1 -9' 


/logcA q,/4 /log d\ 2 

“ + br 


+C 6 n 


-1/2 -d 


+ C 7 I U \j ^ + 1 ^ >logc? | ^ + 1 ) 1 °S rf 


n 


n 


(36) 


uniformly over all u € [0,1] with probability at least 1 — d 13 . Denote by £4 the event that 
inequality (36) holds. Then 


¥{£ 4 } > P{£ 4 |£ 3 }P(£ 3 ) > (1 - d~ p ) 2 > 1 - 2 dr p . 


Together with (31), we get, under event £ 4 , 


l|A|| 


2 

l 2 (u) 


< l|A||i 2 (nn) +C 8 C ^VqU 2 + C 8 C^n (1 max{ 2 ’ 2 “+i» 

, |A|k(n) + 


pCgU 


-1/2 -d 


+C,|M/ (/j + 1)1 ° S<j + (/i + 1)1 ° gd 


n 


n 



(37) 


for some constant Cg > 0 . 
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Step 3. Putting it together. Combining (30) and (37), we get 


|A||l 2 (ii) < C 9 r] q ( 2 “ 1 ||A||2 2 (n) 

4 c 3 + 1) logd 


+C 9 C 


n 


+C 9 (^in _(1_m “ { ^ 1) 

,_ /l n p. r J\ !/2-g/4 

+Cgy/P + 1 (~^~J ll A IU 2 (n) 


+C 9 y//3 + In 2 «+m I 


+c 9 (/? + i)( logd 

-\~Cg 7 X 


log d 


n 


n 

<3 


- 1/2 -d 


for some constant C 9 > 0, under the event £\ D £ 2 fl £ 4 . 
Take ( large enough so that 

C 9Vq C^ < 1 / 2 . 

Then 


|^||| 2 (n) < 2 C 9 


^__ja_ (^+l)logd 

n 

+2C 9 ( 2 ^+t n -(i-max{ f. 2^+r 1) 
+2C,^Tl (^y~ 9/4 II^IU 2 (n) 

+2C 9 v/dTTn _ 2k7r . 

V n 

+ 2 C 9 V ^Tl(!^) 

+ 2 C' 9 n- 1 / 2 e - d . 


1-9 

2 


Therefore, there exists a constant C 10 > 0 such that, under the event £\ fl £2 D £ 4 , 

IIAIIU < C 10 C8 + 1) (»-* + (!^)" 5 + (!2f) 1/ "' ,/4 ||A|U 3( n ) ) . 

which implies (12). Statement (12) now follows from the fact that 

P{Ti n £2 n £ 4 } > 1 - P{£[} - p{T 2 c } - p{T 4 c } > l - 4 dr p , 
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and appropriate re-scaling of the constants. 

To show (13), we first derive, via an identical argument to Step 2, that 


A 


2 


l 2 ( n,o 


< H^HL(n)- 


+C*ii 



{P + 1) l°g d f f logd\ 9/4 u + 2 / log d 


n 


2 ’ 2a + l 


}) 


+ C ll n- 1 / 2 e~ d 


-\-Cii 



{P + 1) log d (/3 + 1) log d 


n 


n 


(38) 


for some constant CYi > 0. Together with (12), this implies (13). 
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Appendix A — Proof of Lemma 1 

An application of Talagrand’s concentration inequality yields, with probability at least 1 — 



It is well known that there exists a numerical constant C\ > 0 


E R jn (u) < {E [R jn (u)} 2 } 1/2 < Cpn-^u 1 -^ 


See, e.g., Mendelson (2002) or Koltchinskii (2011). In other words, with probability at least 
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for some numerical constant C 2 > 0. We now make this inequality uniform over u G [0,1] 
via a peeling argument. 

In particular, with probability at least 1 — exp(— /3 logd —2 log j) for some constant (3 > 0, 


sup 

2^<||/*lh 2 (np<2- J ' +1 



< Rjn( 2- J+1 ) 


< C 2 n ~ x ' 2 


( 2 -l+ 1 ) 1 -^ + 2- j+1 (/3 logd + 2 log j ) 1 ' 2 


+n 1//2 (/3 logd + 2 log j) 


By union bound, there exists a constant C 3 > 0 such that 

Rjniu) < C 3 n ~ 1/2 ^u l ~ ^ + u\/ (3 log d + , 

holds for any u G ( e - rf ( 2Q, /(2a-i)) ; 1], with probability at least 


[2adlog 2 e/(2a-1)1 

1 - exp(—/3 logd — 21 ogj) > 1 — 2 d~ 13 . 

3 = 1 


On the other hand, when u < e d ( 2 “/( 2a 1 )) ) 

< C 2 n-^ fe- + e-‘< 2 “/< 2 ”-'» v ^toirf + 

< 2 C 2 n -/ 2 (e-+^), 

with probability at least 1 — d~&, for sufficiently large d. In summary, there exists a constant 
O 4 > 0 such that 

Rj n {u ) < C 4 n ~ 1/2 + -u^logd + ^~^=— + , 

uniformly over all u G [0,1] with probability at least 1 — 3 d^ 13 . 


Appendix B — Proof of Lemma 2 


Note that 


[l0gW(fil(^l),5, || ■ Hz^)] 172 ^ < C a S 1 2a. 
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Therefore, there exist constants C\,Ci > 0 such that for any fixed u G [0,1] 

P | Zj n (u) < C\rT x ^ 2 (^u x ~^ + ut 1 ^ 2 ^ | < C 2 exp [— (tC 1//o! + t)] . 

See, e.g., van de Geer (2000; Corollary 8.3). The rest of the proof follows a similar peeling 
argument as that for Lemma 1 and is omitted for brevity. 
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